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Abstract 

Memory-based learning, keeping full memory 
of learning material, appears a viable approach 
to learning nlp tasks, and is often superior 
in generalisation accuracy to eager learning 
approaches that abstract from learning mate- 
rial. Here we investigate three partial memory- 
based learning approaches which remove from 
memory specific task instance types estimated 
to be exceptional. The three approaches each 
implement one heuristic function for estimat- 
ing exceptionality of instance types: (i) typi- 



effort until new instances are presented. On be- 
ing presented with an instance, a memory-based 
learning algorithm searches for a best-matching in- 
stance, or, more generically, a set of the k best- 
matching instances in memory. Having found such 
a set of k best-matching instances, the algorithm 
takes the (majority) class with which the instances 
in the set are labeled to be the class of the new 
instance. Pure memory-based learning algorithms 
imple ment the classic fc-ncarest neighbour algo- 
rithm ( Cover and Hart, 1967 ; Devijver and Kittlcr 



cality, (ii) class prediction strength, and (iii) 



friendly-neighbourhood size. Experiments are 
performed with the memory-based learning al- 
gorithm ibI-ig trained on English word pro- 
nunciation. We find that removing instance 
types with low prediction strength (h) is the 



198g ; |Aha, Kibler, and Albert, 199l|) ; in different 



contexts, memory-based learning algorithms have 
also been named lazy, instance-based, exemplar- 
based, memory-based, case-based learning or reason- 
ing flStanfill and Waltz, 1986|; |Kolodncr, 1993) ; |Aha 



only tested method which does not seriously 



harm generalisation accuracy. We conclude 
that keeping full memory of types rather than 
tokens, and excluding minority ambiguities ap- 
pear to be the only performance-preserving op- 
timisations of memory-based learning. 



Kibler, and Albert, 1991 ; [Aha, 1997| )) 



Memory-based learning has been demonstrated 
to yield accurate models of various natural lan- 
guage tasks such as grapheme-phoneme conver- 
sion, word stress assignment, part-of-speech tag- 



1 Introduction 



ging, and PP-attachmcnt (Daelemans, Weijters, and 



Van den Bosch, 1997). For example, the memory- 



based learning algorithm ibI-ig (Daelemans and 



biaiich uf 



diidi btre 



supervised machine learning m w 
rear iiing phase consisls simply of storrrrg a"rl en- 

countcred instances from a training set in mem- 
ory (Aha, 1997). Memory-based learning algorithms 
do not invest effort during learning in abstract- 
ing from the training data, such as eager-learning 
(e.g., decision-tree algorithms, rule-induction, or 
connectionist-learning algorithms, flQuinlan, 1995 



Memui.y-based teaming of classification tasks is a V an den Bosch, 19921; foaelemans, Van den Bosch 



and Weijters, 1997), which extends the well-known 



ibI algorithm (Aha, Kibler, and Albert, 1991) 



Miu'lmll, 



with an information-gain weighted similarity met- 
ric, has been demonstrated to perform adequately 
and, moreover, consistently and significantly better 
than eager-learning algorithms which do invest ef- 
fort in abstrac tion during learning (e.g., decision- 
tree learning (Daelemans, Van den Bosch, and 



This research was done in the context of the "Induc- 
tion of Linguistic Knowledge" research programme, par- 
tially supported by the Foundation for Language Speech 
and Logic (TSL), which is funded by the Netherlands 
Organization for Scientific Research (NWO). Part of the 
first author's work was performed at the Department of 



1997)) do. RaLlier, LllCy defof invesLhlg Wei jter^~1997|; iQninlan, 1993|), and connectioni st 



learning ( Rumelhart, Hinton, and Williams, 1980 )) 
when trained and tested on a range of morpho- 
phonological tasks (e.g., morphological segmenta- 
tion, grapheme-phoneme conversion, syllabification, 



and word stress assignment) (Daelemans, Gillis, and 



Computer Science of the Universiteit Maastricht. 



Durieux, 1994; Van den Bosch, Daelemans, and Wei- 



jters, 1996; Van den Bosch, 1997). Thus, when 



learning NLP tasks, the abstraction occurring in de- 
cision trees (i.e., the explicit forgetting of informa- 
tion considered to be redundant) and in connec- 
tionist networks (i.e., a non-symbolic encoding and 
decoding in relatively small numbers of connection 
weights) both hamper accurate generalisation of the 
learned knowledge to new material. 

These findings appear to contrast with the gen- 
eral assumption behind eager learning, that data 
representing real-world classification tasks tends to 
contains (i) redundancy and (ii) exceptions: redun- 
dant data can be removed, yielding smaller descrip- 
tions of the original data; some exceptions (e.g., low- 
frequency exceptions) can (or should) be discarded 
since they are expected to be bad predictors for clas- 
sifying new (test) material. However, both redun- 
dancy and exceptionality cannot be computed triv- 
ially; heuristic functions are generally used to esti- 
mate them (e.g., functions from information theory 



( iQuinlan, 1993|) ). The lower generalisation accura- 
cies of both decision-tree and connectionist learning, 
compared to memory-based learning, on the above- 
mentioned nlp tasks, suggest that these heuristic 
estimates may not hold for data representing NLP 
tasks. It appears that in order to learn such tasks 
successfully, a learning algorithm should not forget 
(i.e., explicitly remove from memory) any informa- 
tion contained in the learning material: it should not 
abstract from the individual instances. 

An obvious type of abstraction that is not harm- 
ful for generalisation accuracy (but that is not al- 
ways acknowledged in implementations of memory- 
based learning) is be the straightforward abstraction 
from tokens to types with frequency information. 
In general, data sets representing natural language 
tasks, when large enough, tend to contain consider- 
able numbers of duplicate sequences mapping to the 
same output or class. For example, in data repre- 
senting word pronunciations, some sequences of let- 
ters, such as ing at the end of English words, occur 
hundreds of times, while each of the sequences is 
pronounced identically, viz. /irj/. Instead of storing 
all individual sequence tokens in memory, each set 
of identical tokens can be safely stored in memory 
as a single sequence type with frequency informa- 



tion, without loss of generalisation accuracy (Daclc 



mans and Van den Bosch, 1992; Daelemans, Van 



den Bosch, and Weijters, 1997 ). Thus, forgetting in- 
stance tokens and replacing them by instance types 
may lead to considerable computational optimisa- 
tions of memory-based learning, since the memory 
that needs to be searched may become considerably 
smaller. 



Given the safe, performance-preserving optimisa- 
tion of replacing sets of instance tokens by instance 
types with frequency information, a next step of in- 
vestigation into optimising memory-based learning 
is to measure the effects of forgetting instance types 
on grounds of their exceptionality, the underlying 
idea being that the more exceptional a task instance 
type is, the more likely it is that it is a bad predic- 
tor for new instances. Thus, exceptionality should in 
some way express the unsuitability of a task instance 
type to be a best match (nearest neighbour) to new 
instances: it would be unwise to copy its associated 
classification to best-matching new instances. In this 
paper, we investigate three criteria for estimating 
an instance type's exceptionality, and removing in- 
stance types estimated to be the most exceptional 
by each of these criteria. The criteria investigated 
are 

1. typicality of instance types; 

2. class prediction strength of instance types; 

3. friendly-neighbourhood size of instance types; 

4. random (to provide a baseline experiment). 

We base our experiments on a large data set of 
English word pronunciation. Wc briefly describe 
this data set, and the way it is converted into an 
instance base fit for memory-based learning, in Sec- 
tion ||. In Section || we describe the settings of our 
experiments and the memory-based learning algo- 
rithm ibI-ig with which the experiments are per- 
formed. We then turn to describing the notions 
of typicality, class-prediction strength, and friendly- 
neighbourhood size, and the functions to estimate 
them, in Section ^. Section [| provides the experi- 
mental results. In Section [| we discuss the obtained 
results and formulate our conclusions. 

2 The word-pronunciation data 

Converting written words to stressed phonemic tran- 
scription, i.e., word pronunciation, is a well-known 



benchmark task in machine learning ( 


Stanfill and 


Waltz, 1986; Sejnowski and Rosenberg, 1987; 


Diet- 


terich, Hild, and Bakiri, 199C; 


Wolpert, 1990 


). We 



define the task as the conversion of fixed-sized in- 
stances representing parts of words to a class rep- 
resenting the phoneme and the stress marker of the 
instance's middle letter. To generate the instances, 
windowing is used ( {Scjnowski and Rosenberg, 1987 ) . 



Table ^ displays example instances and their classi- 
fications generated on the basis of the sample word 
booking. Classifications, i.e., phonemes with stress 



markers (henceforth PSs), are denoted by compos- 
ite labels. For example, the first instance in Table [l], 

book, maps to class label /b/1, denoting a /b/ 

which is the first phoneme of a syllable receiving pri- 
mary stress. In this study, we chose a fixed window 



width of seven letters, which offers sufficient con- 



text information for adequate performance, though 
extension of the window decreases ambiguity within 
the data set (Van den Bosch, 1997). 

The task, henceforth referred to as GS (Grapheme- 
phoneme conversion and Stress assignment) is sim- 
ilar to the nettalk task presented by Sejnowski 
and Rosenberg (1986), but is performed on a larger 
corpus of 77,565 English word-pronunciation pairs, 



extracted from the CELEX lexical data base (Bur- 
nage, 199C| ). Converted into fixed-sized instance, the 



full instance base representing the GS task contains 
675,745 instances. The task features 159 classes 
(combined phonemes and stress markers). 

3 Algorithm and experimental setup 
3.1 Memory-based learning in IB1-IG 

In the experiments reported here, we employ iBl- 



IG (Daelemans and Van den Bosch, 1992; Daele- 



mans, Van den Bosch, and Weijters, 1997), which 
has been demonstrated to perform adequately, and 
better than eager-learning algorithms on the GS task 
(Van den Bosch, 1997). ib1-ig constructs an in- 
stance base during learning. An instance in the in- 
stance base consists of a fixed-length vector of n 
feature- value pairs (here, n = 7), an information 
field containing the classification of that particular 
feature-value vector, and an information field con- 
taining the occurrences of the instance with its clas- 
sification in the full training set. The latter informa- 
tion field thus enables the storage of instance types 
rather than the more extensive storage of identical 
instance tokens. After the instance base is built, new 
(test) instances are classified by matching them to 
all instance types in the instance base, and by cal- 
culating with each match the distance between the 
new instance A and the memory instance type Y, 
A (A, Y), using the function given in Eq. [|: 



A(X,Y) = Y,W(f t )5(X t ,Y l ) 1 



(1) 



where W(fi) is the weight of the ith feature, and 
S{xi,Ui) is the distance between the values of the 
ith feature in the instances A and Y. When the 
values of the instance features are symbolic, as with 
the GS task (i.e., feature values are letters), a simple 
distance function is used (Eq. ||): 

5{X i ,Y i ) = 0if X i = Y i elsel. (2) 



The classification of the memory instance type Y 
with the smallest A (A, Y) is then taken as the clas- 
sification of A. This procedure is also known as 
1-nn, i.e., a search for the single nearest neighbour, 



the simplest variant of fc-NN (Devijver and Kittler 



1982) 



The weighting function of ibI-ig, W(fi), repre- 
sents the information gain of feature fi. Weight- 
ing features in &;-NN classifiers such as ibI-ig is an 
active field of research (cf. ( Wcttschcreck, 1995] ; 



Wettschereck, Aha, and Mohri, 1997), for compre- 
hensive overviews and discussion) . Information gain 
is a function from information theory also used in 
id3 dQuinlan, 1986| ) and C4.5 flQuinlan, 1993| ). The 
information gain of a feature expresses its relative 
relevance compared to the other features when per- 
forming the mapping from input to classification. 

The idea behind computing the information gain 
of features is to interpret the training set as an in- 
formation source capable of generating a number of 
messages (i.e., classifications) with a certain proba- 
bility. The information entropy H of such an infor- 
mation source can be compared in turn for each of 
the features characterising the instances (let n equal 
the number of features), to the average information 
entropy of the information source when the value of 
those features are known. 

Data-base information entropy H(D) is equal to 
the number of bits of information needed to know 
the classification given an instance. It is computed 
by equation ||, where pi (the probability of classifi- 
cation i) is estimated by its relative frequency in the 
training set. 



(3) 



To determine the information gain of each of the n 
features ft . . . f n , we compute the average informa- 
tion entropy for each feature and subtract it from 
the information entropy of the data base. To com- 
pute the average information entropy for a feature 
fi, given in equation [|, we take the average informa- 
tion entropy of the data base restricted to each pos- 
sible value for the feature. The expression D^. =Vj ^ 
refers to those patterns in the data base that have 
value Vj for feature fi, j is the number of possible 
values of and V is the set of possible values for 
feature fi. Finally, is the number of patterns in 
the (sub) data base. 



\D\ 



(4) 



Information gain of feature fi is then obtained by 
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Table 1: Example of instances generated for the word-pronunciation task from the word booking. 



equation |[ 



G(fi) = H(D) - H(D [fi] ) (5) 

Using the weighting function W(fi) acknowledges 
the fact that for some tasks, such as the current GS 
task, some features are far more relevant (impor- 
tant) than other features. Using it, instances that 
match on a feature with a relatively high informa- 
tion gain are regarded as less distant (more alike) 
than instances that match on a feature with a lower 
information gain. 

Finding a nearest neighbour to a test instance may 
result in two or more candidate nearest-neighbour 
instance types at an identical distance to the test in- 
stance, yet associated with different classes. The im- 
plementation of IB1-IG used here handles such cases 
in the following way. First, when two or more best- 
matching instance types are found associated to dif- 
ferent classes, ib1-ig selects the class of the instance 
type with the highest occurrence in the set of best- 
matching instance types. In case of occurrence ties, 
the classification of one of the set of best-matching 
instance types is selected that has the highest over- 
all occurrence in the training set. (Daelemans, Van 
den Bosch, and Weijters, 1997). 



3.2 Setup 

We performed a series of experiments in which iBl- 
IG is applied to the GS data set, systematically edited 
according to each of the three tested criteria (plus 
the baseline random criterion) described in the next 
section. We performed the following global proce- 
dure: 

1. We partioned the full GS data set into a training 
set of 608,228 instances (90% of the full data 
set) and a test set of 67,517 instances (10%). 
For use with ibI-ig, which stores instance types 
rather than instance tokens, the data set was re- 
duced to contain 222,601 instance types (i.e., 
unique combinations of feature-value vectors 



and their classifications), with frequency infor- 
mation. 

2. For each exceptionality criterion (i.e., typ- 
icality, class prediction strength, friendly- 
neighbourhood size, and random selection), 

(a) we created four edited instance bases by 
removing 1%, 2%, 5%, and 10% of the 
most exceptional instance types (according 
to the criterion) from the training set, re- 
spectively. 

(b) For each of these increasingly edited train- 
ing sets, we performed one experiment in 
which ibI-ig was trained on the edited 
training set, and tested on the original 
unedited test set. 

4 Three estimations of 
exceptionality 

We investigate three methods for estimating the 
(degree of) exceptionality of instance types: typ- 
icality, class prediction strength, and friendly- 
neighbourhood size. 

4.1 Typicality 

In its common meaning, "typicality" denotes 
roughly the opposite of exceptionality; atypicality 
can be said to be a syn onym of exce ptionality. We 
adopt a definition from ( Zhang, 1992 ), who proposes 
a function to this end. Zhang computes typicalities 
of instance types by taking both their feature value s 
and their classifications into account ( Zhang, 1992 ). 
He adopts the notions of intra- concept similarity a nd 
inter-concept similarity ( [Rosch and Mcrvis, 1975 ) to 
do this. First, Zhang introduces a distance func- 
tion similar to Equation |l|, in which W(fi) = 1.0 
for all features (i.e., flat Euclidean distance rather 
than information-gain weighted distance), in which 
the distance between two instances X and Y is nor- 
malised by dividing the summed squared distance by 



ro, the number of features, and in which 8{xi,yi) is 
given as Equation |^. The normalised distance func- 
tion used by Zhang is given in Equation |TJ. 



A(X,Y) 



\ 



n 

71 ' * 



(G) 



The intra-concept similarity of instance X with 
classification C is its similarity (i.e., 1— distance) 
with all instances in the data set with the same clas- 
sification C: this subset is referred to as X's family, 
Fam(X). Equation gives the intra-concept simi- 
larity function Intra(X) (\Fam(X) \ being the num- 
ber of instances in X's family, and Fam(X) 1 the ith 
instance in that family). 



Intra(X) = 



\Fam(X)\ 



\Fam(X) 

E 



1.0-A(X, Fam,{X) r 
(7) 

All remaining instances belong to the subset of un- 
related instances, Unr(X). The inter-concept simi- 
larity of an instance X, Inter(X), is given in Equa- 
tion^] (with \Unr(X) \ being the number of instances 
unrelated to X, and Unr(X) 1 the zth instance in 
that subset). 



Inter(X) 



1 



\Unr{X)\ 



\Unr(X)\ 

E 

i=l 



1.0 -A(X, Unr(X) 1 ) 
(8) 

The typicality of an instance X , Typ(X), is the quo- 
tient of X's intra-concept similarity and X's inter- 
concept similarity, as given in Equation ^|. 



Typ(X) = 



Intra(X) 
Inter(X) 



(9) 



An instance type is typical when its intra-concept 
similarity is larger than its inter-concept similar- 
ity, which results in a typicality larger than 1. 
An instance type is atypical when its intra-concept 
similarity is smaller than its inter-concept similar- 
ity, which results in a typicality between and 1. 
Around typicality value 1, insta nces cannot be sen- 
sibly called typical or atypical; ( Zhang, 1992 ) refers 
to such instances as boundary instances. 

In our experiments, we compute the typicality of 
all instance types in the training set, order them 
on their typicality, and remove 1%, 2%, 5%, and 
10% of the instance types with the lowest typicality, 
i.e., the most atypical instance types. In addition to 
these four experiments, we performed an additional 
eight experiments using the same percentages, and 
editing on the basis of (i) instance types' typicality 



(by ordering them in reverse order) and (ii) their in- 
difference towards typicality or atypicality (i.e., the 
closeness of their typicality to 1.0, by ordering them 
in order of the absolute value of their typicality sub- 
tracted by 1.0). The experiments with removing typ- 
ical and boundary instance types provide interesting 
comparisons with the more intuitive editing of atyp- 
ical instance types. 

Table [2] provides examples of four atypical, bound- 
ary, and typical instance types found in the train- 
ing set. Globally speaking, (i) the set of atypi- 
cal instances tend to contain foreign spellings of 
loan words; (ii) there is no clear characteristic of 
boundary instances; and (iii) instance types with 
high typicality values often involve instance types 
of which the middle letters are at the beginning of 
words or immediately following a hyphen, or high- 
frequency instance types, or instance types mapping 
to a low-frequency class that always occurs with a 
certain spelling (class frequency is not accounted for 
in Zhang's metric). 

4.2 Class-prediction strength 

A second estimate of exceptionality is to measure 
how well an instance type predicts the class of 
all instance types within the training set (includ- 
ing itself). Several functions for computing class- 
prediction strength have been proposed, e.g., as a 
criterion for removing instances in mem ory-based 
(fc-nn) learning algorit hms, such as ib3 ( Aha, Ki- 
bler, and Albert, 1991) (cL earlier work on edited 



fc-nn (|Wilson, 1972| ; |Voisin and Dcvijver, 1987|) ) 



or for weighting instances in the Each algorithm 
( (Balzberg, 199C| ; |Cost and Salzberg, 1993| ). We chose 
to implement the straightforward class-prediction 
strength function as proposed in (Salzberg, 199C) 
in two steps. First, we count (a) the number of 
times that the instance type is the nearest neigh- 
bour of another instance type, and (b) the number 
of occurrences that when the instance type is a near- 
est neighbour of another instance type, the classes 
of the two instances match. Second, the instance's 
class-prediction strength is computed by taking the 
ratio of (b) over (a). An instance type with class- 
prediction strength 1.0 is a perfect predictor of its 
own class; a class-prediction strength of 0.0 indicates 
that the instance type is a bad predictor of classes 
of other instances, presumably indicating that the 
instance type is exceptional. 

We computed the class-prediction strength of all 
instance types in the training set, ordered the in- 
stance types according to their strengths, and cre- 
ated edited training sets with 1%, 2%, 5%, and 10% 
of the instance types with the lowest class predic- 









instance types 










atypical 






boundary 






typical 




feature values class 


typicality 


feature values class 


typicality 


feature values class 


typicality 


ureaucr 




0.428 


cheques 


Oks 


1.000 


oilf 


lDi 


7.338 


freudia 


Chi 


0.442 


elgium_ 


0- 


1.000 


etectio 


Okf 


8.452 


.tissue 


oj 


0.458 


la by 


Oai 


1.000 


ow-by-b 


0b 


9.130 


__czech 


0- 


0.542 


manna__ 


0- 


1.000 


ng-iron 


2aia 


12.882 



Table 2: Examples of atypical (left), boundary (middle), and typical (left) instance types in the training set. 
For each instance (seven letters and a class mapping to the middle letter), its typicality value is given. 
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Table 3: Examples of instance types with the lowest 
possible class prediction strength (cps) 0.0. 



tion strength removed, respectively. In Table |3J, 
four sample instance types are displayed which have 
class-prediction strength 0.0, i.e., the lowest pos- 
sible strength. They are never a correct nearest- 
neighbour match, since they all have counterpart 
types with the same feature values. For example, the 
letter sequence algo occurs in two types, one asso- 
ciated with the pronunciation /'a?/ (viz., primary- 
stressed /as/, or lae in our labelling), as in algo- 
rithm and algorithms; the other associated with the 
pronunciation /"ae/ (viz. secondary-stressed /as/ or 
2a?), as in algorithmic. The latter instance type oc- 
curs less frequently than the former, which is the 
reason that the class of the former is preferred over 
the minority class. Thus, an ambiguous type with 
a minority class (a minority ambiguity) can never 
be a correct predictor, not even for itself, when us- 
ing IB1-IG as a classifier, which always prefers high 
frequency over low frequency in case of ties. 

4.3 Friendly-neighbourhood size 

A third estimate for the exceptionality of instance 
types is counting by how many nearest neighbours of 
the same class an instance type is surrounded in in- 
stance space. Given a training set of instance types, 
for each instance type a ranking can be made of all of 
its nearest neighbours, ordered by their distance to 
the instance type. The number of nearest-neighbour 
instance types in this ranking with the same class, 
henceforth referred to as the friendly-neighbourhood 
size, may range between and the total number of 



instance types of the same class. When the friendly 
neighbourhood is empty, the instance type only has 
nearest neighbours of different classes. The argu- 
mentation to regard a small friendly neighbourhood 
as an indication of an instance type's exceptionality, 
follows from the same argumentation as used with 
class-prediction strength: when an instance type has 
nearest neighbours of different classes, it is vice versa 
a bad predictor for those classes. Thus, the smaller 
an instance type's friendly neighbourhood, the more 
it could be regarded exceptional. 

Friendly-neighbourhood size and class-prediction 
strength are related functions, but differ in their 
treatment of class ambiguity. As stated above, in- 
stance types may receive a class-prediction strength 
of 0.0 when they are minority ambiguities. Counting 
a friendly neighbourhood does not take class ambi- 
guity into account; each of a set of ambiguous types 
necessarily has no friendly neighbours, since they are 
eachother's nearest neighbours with different classes. 
Thus, friendly- neighbour hood size does not discrim- 
inate between minority and majority ambiguities. In 
Table ^, four sample instance types are listed with 
friendly-neighbourhood size 0. While some of these 
instance types without friendly neighbours in the 
training set (perhaps with friendly neighbours in the 

test set) are minority ambiguities (e.g., edib 2s), 

others are majority ambiguities (e.g., edib 18), 

while others are not ambiguous at all but simply 
have a nearest neighbour at some distance with a 
different class (e.g., soiree. Or). 

5 Results 

Figure ^ displays the generalisation accuracies in 
terms of incorrectly classified test instances obtained 
with all performed experiments. The leftmost point 
in the Figure, from which all lines originate, indi- 
cates the performance of ibI-IG when trained on 
the full data set of 222,601 types, viz. 6.42% in- 
correctly classified test instances (when computed in 
terms of incorrectly pronounced test words, ibI-ig 
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Table 4: Examples of instance types with the lowest 
possible friendly-neighbourhood size (fns) 0, i.e., no 
friendly neighbours. 



pronounces 64.61 of all test words flawlessly). 

The line graph representing the four experiments 
in which instance types are removed randomly can 
be seen as the baseline graph. It can be expected 
that removing instances randomly leads to a degra- 
dation of generalisation performance. The upward 
curve of the line graph denoting the experiments 
with random selection indeed shows degrading per- 
formance with increasing numbers of left-out in- 
stance types. The relative decrease in generalisation 
accuracy is 2.0% when 1% of the training material is 
removed randomly, 3.8% with 2% random removal, 
10.7% with 5% random removal, and 20.7% with 
10% random removal. 

Surprisingly, the only experiments showing lower 
performance degradation than removal by random 
selection are those with class-prediction strength; 
the other criteria for removing exceptional instances 
lead to worse degradations. It does not matter 
whether instance types are removed on grounds of 
their typicality: apparently, a markedly low, neutral, 
or high typicality value indicates that the instance 
type is (on average) important, rather than remov- 
able. The same applies to friendly-neighbourhood 
size: instances with small neighbourhood sizes ap- 
pear to contribute significantly to performance on 
test material. It is remarkable that the largest er- 
rors with 1% and 2% removal are obtained with 
the friendly-neighbourhood size criterion: it appears 
that on average, the instances with few or no nearest 
neighbours are important in the classification of test 
material. 

When using class-prediction strength as removal 
criterion, performance does not degrade until about 
5% of the instance types with the lowest strength 
are removed from memory. The reason is that class- 
prediction strength is the only criterion that detects 
minority ambiguities, i.e., instance types with pre- 
diction strength 0.0, that cannot contribute to classi- 
fication since they are always overshadowed by their 
counterpart instance types with majority classes, 
even for their own classification. In the training set, 
9,443 instance types are minority ambiguities, i.e., 



4.2% of the instance types (accounting for 3.8% of 
the instance tokens in the original token set). 

Thus, among the tested methods for reducing 
the memory needed for storing an instance base in 
memory-based learning, only two are performance- 
preserving while accounting for a substantial reduc- 
tion in the amount of memory needed by IB1-IG: 

1. Replacing instance tokens by instance types ac- 
counts for a reduction of about 63% of mem- 
ory needed to store instances, excluding the 
memory needed to store frequency information. 
When frequency information is stored in two 
bytes per instance type, the memory reduction 
is about 54%. 

2. Removing instance types that are minority am- 
biguities on top of the type/token-reduction ac- 
counts only for an additional memory reduc- 
tion of 2%, i.e., for a total memory reduction 
of 65%; 56% with two-byte frequency informa- 
tion stored per instance. 

6 Discussion and future research 



As previous research has suggested (Daelemans 



1996; Daelemans, Weijters, and Van den Bosch 



1997; Van den Bosch, 1997), keeping full mem- 



ory in memory-based learning of word pronunciation 
strongly appears to yield optimal generalisation ac- 
curacy. The experiments in this paper show that op- 
timisation of memory use in memory-based learning 
while preserving generalisation accuracy can only be 
performed by (i) replacing instance tokens by in- 
stance types with frequency information, and (ii) 
removing minority ambiguities. Both optimisations 
can be performed straightforwardly; minority ambi- 
guities can be traced with less effort than by using 
class-prediction stre ngth. Our implementation of 



ibI-i g described in (Daelemans and Van den Bosch 



1992; Daelemans, Van den Bosch, and Weijters 



1997 ) already makes use of this knowledge, albeit 
partially (it stores class distributions with letter- 
window types). We note that with k > 2 (in our 
tested implementation of ibI-ig, k = 1), removing 
minority ambiguities may distort performance, since 
taking more than single nearest neighbours into ac- 
count may allow for minority ambiguities to play a 
constructive role in classification. 

Our results also sho w that atypic ality, non-typic- 
ality, and typicality ( Zhang, 1992 ), and friendly- 
neighbourhood size are all estimates of exception- 
ality that indicate the importance of instance types 
for classification, rather than their removability. As 
far as these estimates of exceptionality are viable, 
our results suggest that exceptions should be kept 
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Figure 1: Generalisation errors (percentages of incorrectly classified test instances of tribl-ig, with increased 
numbers of edited instances, according to the tested exceptionality criteria atypical, typical, boundary, 
small neighbourhood, low prediction strength, and random selection. Performances, denoted by points, are 
measured when 1%, 2%, 5%, and 10% of the most exceptional instance types are edited. 



in memory and not be thrown away. 

The results of the present study suggest that 
the following questions be investigated in future re- 
search: 

• The tested criteria ca n be employed a s instance 
weig hts as in Each (galzbcrg 1990 ) and Pe- 
bls ( Cost and Salzberg, 1993| ), rather than as 
criteria for instance removal. Instance weight- 
ing may add relevant information to similar- 
ity matching, and may improve ibI-ig's perfor- 
mance rather than just preserving it. 

• The tested implementation of ibI-IG performs a 
fc-nn search through instance space with k = 1. 
When k > 1, ibI-ig's performance may change, 
as well as the effect of applying the instance- 
removal techniques tested here. Removing in- 
stances from memory may have a less drastic 
effect when more instance types at more dis- 
tances are allowed to match a new instance. 
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